{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Parameter Learning in Discrete Bayesian Networks" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this notebook, we show an example for learning the parameters (CPDs) of a Discrete Bayesian Network given the data and the model structure. pgmpy has two main methods for learning the parameters:\n", "1. MaximumLikelihood Estimator (pgmpy.estimators.MaximumLikelihoodEstimator)\n", "2. Bayesian Estimator (pgmpy.estimators.BayesianEstimator)\n", "3. Expectation Maximization (pgmpy.estimators.ExpectationMaximization)\n", "\n", "In the examples, we will try to generate some data from given models and then try to learn the model parameters back from the generated data." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Step 1: Generate some data" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Generating for node: CVP: 100%|██████████| 37/37 [00:01<00:00, 24.08it/s] \n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
HISTORYCVPPCWPHYPOVOLEMIALVEDVOLUMELVFAILURESTROKEVOLUMEERRLOWOUTPUTHRBPHREKG...MINVOLSETVENTMACHVENTTUBEVENTLUNGVENTALVARTCO2CATECHOLHRCOBP
0FALSENORMALNORMALFALSENORMALFALSENORMALFALSEHIGHHIGH...NORMALNORMALLOWZEROZEROHIGHHIGHHIGHHIGHHIGH
1FALSENORMALNORMALFALSENORMALFALSENORMALTRUELOWLOW...NORMALNORMALLOWZEROZEROHIGHHIGHLOWLOWLOW
2FALSELOWLOWTRUELOWTRUELOWFALSEHIGHNORMAL...NORMALNORMALZEROLOWHIGHLOWHIGHHIGHLOWLOW
3FALSENORMALNORMALFALSENORMALFALSENORMALFALSEHIGHHIGH...NORMALNORMALLOWZEROZEROHIGHHIGHHIGHHIGHHIGH
4FALSEHIGHHIGHTRUEHIGHFALSENORMALTRUENORMALHIGH...NORMALNORMALZEROHIGHLOWHIGHHIGHHIGHHIGHHIGH
\n", "

5 rows × 37 columns

\n", "
" ], "text/plain": [ " HISTORY CVP PCWP HYPOVOLEMIA LVEDVOLUME LVFAILURE STROKEVOLUME \\\n", "0 FALSE NORMAL NORMAL FALSE NORMAL FALSE NORMAL \n", "1 FALSE NORMAL NORMAL FALSE NORMAL FALSE NORMAL \n", "2 FALSE LOW LOW TRUE LOW TRUE LOW \n", "3 FALSE NORMAL NORMAL FALSE NORMAL FALSE NORMAL \n", "4 FALSE HIGH HIGH TRUE HIGH FALSE NORMAL \n", "\n", " ERRLOWOUTPUT HRBP HREKG ... MINVOLSET VENTMACH VENTTUBE VENTLUNG \\\n", "0 FALSE HIGH HIGH ... NORMAL NORMAL LOW ZERO \n", "1 TRUE LOW LOW ... NORMAL NORMAL LOW ZERO \n", "2 FALSE HIGH NORMAL ... NORMAL NORMAL ZERO LOW \n", "3 FALSE HIGH HIGH ... NORMAL NORMAL LOW ZERO \n", "4 TRUE NORMAL HIGH ... NORMAL NORMAL ZERO HIGH \n", "\n", " VENTALV ARTCO2 CATECHOL HR CO BP \n", "0 ZERO HIGH HIGH HIGH HIGH HIGH \n", "1 ZERO HIGH HIGH LOW LOW LOW \n", "2 HIGH LOW HIGH HIGH LOW LOW \n", "3 ZERO HIGH HIGH HIGH HIGH HIGH \n", "4 LOW HIGH HIGH HIGH HIGH HIGH \n", "\n", "[5 rows x 37 columns]" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Use the alarm model to generate data from it.\n", "\n", "from pgmpy.utils import get_example_model\n", "from pgmpy.sampling import BayesianModelSampling\n", "\n", "alarm_model = get_example_model(\"alarm\")\n", "samples = BayesianModelSampling(alarm_model).forward_sample(size=int(1e5))\n", "samples.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Step 2: Define a model structure\n", "\n", "In this case, since we are trying to learn the model parameters back we will use the model structure that we used to generate the data from." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "NodeView(('HYPOVOLEMIA', 'LVEDVOLUME', 'STROKEVOLUME', 'CVP', 'PCWP', 'LVFAILURE', 'HISTORY', 'CO', 'ERRLOWOUTPUT', 'HRBP', 'ERRCAUTER', 'HREKG', 'HRSAT', 'INSUFFANESTH', 'CATECHOL', 'ANAPHYLAXIS', 'TPR', 'BP', 'KINKEDTUBE', 'PRESS', 'VENTLUNG', 'FIO2', 'PVSAT', 'SAO2', 'PULMEMBOLUS', 'PAP', 'SHUNT', 'INTUBATION', 'MINVOL', 'VENTALV', 'DISCONNECT', 'VENTTUBE', 'MINVOLSET', 'VENTMACH', 'EXPCO2', 'ARTCO2', 'HR'))" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Defining the Bayesian Model structure\n", "\n", "from pgmpy.models import BayesianNetwork\n", "\n", "model_struct = BayesianNetwork(ebunch=alarm_model.edges())\n", "model_struct.nodes()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Step 3: Learning the model parameters " ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+--------------+---------+\n", "| FIO2(LOW) | 0.04859 |\n", "+--------------+---------+\n", "| FIO2(NORMAL) | 0.95141 |\n", "+--------------+---------+\n", "+-------------+----------------------+-----------------------+----------------------+\n", "| LVEDVOLUME | LVEDVOLUME(HIGH) | LVEDVOLUME(LOW) | LVEDVOLUME(NORMAL) |\n", "+-------------+----------------------+-----------------------+----------------------+\n", "| CVP(HIGH) | 0.702671646078713 | 0.0069145318521877126 | 0.010257212769589711 |\n", "+-------------+----------------------+-----------------------+----------------------+\n", "| CVP(LOW) | 0.009480034472852629 | 0.9526184538653366 | 0.03999032606840039 |\n", "+-------------+----------------------+-----------------------+----------------------+\n", "| CVP(NORMAL) | 0.28784831944843436 | 0.04046701428247563 | 0.94975246116201 |\n", "+-------------+----------------------+-----------------------+----------------------+\n" ] }, { "data": { "text/plain": [ "[,\n", " ,\n", " ,\n", " ,\n", " ,\n", " ,\n", " ,\n", " ,\n", " ,\n", " ]" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Fitting the model using Maximum Likelihood Estimator\n", "\n", "from pgmpy.estimators import MaximumLikelihoodEstimator\n", "\n", "mle = MaximumLikelihoodEstimator(model=model_struct, data=samples)\n", "\n", "# Estimating the CPD for a single node.\n", "print(mle.estimate_cpd(node=\"FIO2\"))\n", "print(mle.estimate_cpd(node=\"CVP\"))\n", "\n", "# Estimating CPDs for all the nodes in the model\n", "mle.get_parameters()[:10] # Show just the first 10 CPDs in the output" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Verifying that the learned parameters are almost equal.\n", "np.allclose(\n", " alarm_model.get_cpds(\"FIO2\").values, mle.estimate_cpd(\"FIO2\").values, atol=0.01\n", ")" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+--------------+-----------+\n", "| FIO2(LOW) | 0.0530594 |\n", "+--------------+-----------+\n", "| FIO2(NORMAL) | 0.946941 |\n", "+--------------+-----------+\n", "+-------------+----------------------+----------------------+----------------------+\n", "| LVEDVOLUME | LVEDVOLUME(HIGH) | LVEDVOLUME(LOW) | LVEDVOLUME(NORMAL) |\n", "+-------------+----------------------+----------------------+----------------------+\n", "| CVP(HIGH) | 0.6974417067875012 | 0.017649638237228676 | 0.011630213055303717 |\n", "+-------------+----------------------+----------------------+----------------------+\n", "| CVP(LOW) | 0.014065892570565468 | 0.9322516991887744 | 0.041236967361740706 |\n", "+-------------+----------------------+----------------------+----------------------+\n", "| CVP(NORMAL) | 0.2884924006419334 | 0.05009866257399693 | 0.9471328195829556 |\n", "+-------------+----------------------+----------------------+----------------------+\n" ] }, { "data": { "text/plain": [ "[,\n", " ,\n", " ,\n", " ,\n", " ,\n", " ,\n", " ,\n", " ,\n", " ,\n", " ]" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Fitting the using Bayesian Estimator\n", "from pgmpy.estimators import BayesianEstimator\n", "\n", "best = BayesianEstimator(model=model_struct, data=samples)\n", "\n", "print(best.estimate_cpd(node=\"FIO2\", prior_type=\"BDeu\", equivalent_sample_size=1000))\n", "# Uniform pseudo count for each state. Can also accept an array of the size of CPD.\n", "print(best.estimate_cpd(node=\"CVP\", prior_type=\"dirichlet\", pseudo_counts=100))\n", "\n", "# Learning CPDs for all the nodes in the model. For learning all parameters with BDeU prior, a dict of\n", "# pseudo_counts need to be provided\n", "best.get_parameters(prior_type=\"BDeu\", equivalent_sample_size=1000)[:10]" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+--------------+---------+\n", "| FIO2(LOW) | 0.04859 |\n", "+--------------+---------+\n", "| FIO2(NORMAL) | 0.95141 |\n", "+--------------+---------+\n", "+--------------+-----------+\n", "| FIO2(LOW) | 0.0530594 |\n", "+--------------+-----------+\n", "| FIO2(NORMAL) | 0.946941 |\n", "+--------------+-----------+\n" ] } ], "source": [ "# Shortcut for learning all the parameters and adding the CPDs to the model.\n", "\n", "model_struct = BayesianNetwork(ebunch=alarm_model.edges())\n", "model_struct.fit(data=samples, estimator=MaximumLikelihoodEstimator)\n", "print(model_struct.get_cpds(\"FIO2\"))\n", "\n", "model_struct = BayesianNetwork(ebunch=alarm_model.edges())\n", "model_struct.fit(\n", " data=samples,\n", " estimator=BayesianEstimator,\n", " prior_type=\"BDeu\",\n", " equivalent_sample_size=1000,\n", ")\n", "print(model_struct.get_cpds(\"FIO2\"))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The Expecation Maximization (EM) algorithm can also learn the parameters when we have some latent variables in the model." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ " 11%|█ | 11/100 [28:03<3:46:14, 152.52s/it]" ] } ], "source": [ "from pgmpy.estimators import ExpectationMaximization as EM\n", "\n", "# Define a model structure with latent variables\n", "model_latent = BayesianNetwork(\n", " ebunch=alarm_model.edges(), latents=[\"HYPOVOLEMIA\", \"LVEDVOLUME\", \"STROKEVOLUME\"]\n", ")\n", "\n", "# Dataset for latent model which doesn't have values for the latent variables\n", "samples_latent = samples.drop(model_latent.latents, axis=1)\n", "\n", "model_latent.fit(samples_latent, estimator=EM)" ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.10" } }, "nbformat": 4, "nbformat_minor": 1 }